This is our last machine learning lecture, in which we will do an overview of the diffrent algorithms seen in the course.
We will go through the following map of algorithms from the course.
To apply supervised learning, we define a dataset and a learning algorithm.
$$ \underbrace{\text{Dataset}}_\text{Features, Attributes, Targets} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$The output is a predictive model that maps inputs to targets. For instance, it can predict targets on new inputs.
In linear regression, we fit a model $$ f_\theta(x) := \theta^\top \phi(x) $$ that is linear in $\theta$.
The features $\phi(x) : \mathbb{R} \to \mathbb{R}^p$ are non-linear may non-linear in $x$ (e.g., polynomial features), allowing us to fit complex functions.
We define the least squares objective for the model as: $$ J(\theta) = \frac{1}{2} \sum_{i=1}^n (y^{(i)} - \theta^\top x^{(i)})^2 = \frac{1}{2} (X \theta - y)^\top (X \theta - y) $$
We can set the gradient to zero to obtain the normal equations: $$ (X^\top X ) \theta = X^\top y. $$
Hence, the value $\theta^*$ that minimizes this objective is given by: $$ \theta^* = (X^\top X)^{-1} X^\top y.$$
Overfitting is one of the most common failure modes of machine learning.
We can measure overfitting and underfitting by estimating accuracy on held out data and comparing it to the training data.
We will see many ways of dealing with overftting, but here are some ideas:
The idea of regularization is to penalize complex models that may overfit the data.
Regularized least squares optimizes the following objective (Ridge). $$ J(\theta) = \frac{1}{2n} \sum_{i=1}^n \left( y^{(i)} - \theta^\top x^{(i)} \right)^2 + \frac{\lambda}{2} \cdot ||\theta||_2^2. $$ If we use the L1 norm, we have the LASSO.
To find optimal Ridge parameters, we can set the gradient to zero to obtain the normal equations: $$ (X^\top X + \lambda I) \theta = X^\top y. $$
Hence, the value $\theta^*$ that minimizes this objective is given by: $$ \theta^* = (X^\top X + \lambda I)^{-1} X^\top y.$$
Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.
We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.
Logistic regression fits models of the form: $$ f(x) = \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}, $$ where $$ \sigma(z) = \frac{1}{1 + \exp(-z)} $$ is known as the sigmoid or logistic function.
At inference time, we receive a query point $x'$ and we want to predict its label $y'$.
Nearest neighbors is an example of a non-parametric model.
Many supervised learning models have a probabilistic interpretation. Often a model $f_\theta$ defines a probability distribution of the form
\begin{align*} P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1] && \text{or} && P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]. \end{align*}We refer to these as probabilistic models.
For example, our logistic model defines ("parameterizes") a probability distribution $P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]$ as follows:
\begin{align*} P_\theta(y=1 | x) & = \sigma(\theta^\top x) \\ P_\theta(y=0 | x) & = 1-\sigma(\theta^\top x). \end{align*}There are two types of probabilistic models: generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}
We can obtain predictions from generative models via $\max_y P_\theta(x,y)$.
We choose the class whose model fits $x$ best.
Gaussian Discriminant Analysis defines a model $P_\theta$ as follows:
Bernoulli Naive Bayes defines a model $P_\theta$ as follows:
We can learn a generative model $P_\theta(x, y)$ by maximizing the maximum likelihood:
$$ \max_\theta \frac{1}{n}\sum_{i=1}^n \log P_\theta({x}^{(i)}, y^{(i)}). $$This seeks to find parameters $\theta$ such that the model assigns high probability to the training data.
Logistic regression and Naive Bayes are both linear classifiers. However, they yield different decision boundaries.
Intuitively, we want to select linear decision boundaries with high margin: every point is as far as possible from the decision boundary.
import numpy as np
import pandas as pd
from sklearn import datasets
# Load the Iris dataset
iris = datasets.load_iris(as_frame=True)
iris_X, iris_y = iris.data, iris.target
# subsample to a third of the data points
iris_X = iris_X.loc[::4]
iris_y = iris_y.loc[::4]
# create a binary classification dataset with labels +/- 1
iris_y2 = iris_y.copy()
iris_y2[iris_y2==2] = 1
iris_y2[iris_y2==0] = -1
# print part of the dataset
pd.concat([iris_X, iris_y2], axis=1).head()
# https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
import warnings
warnings.filterwarnings("ignore")
# create 2d version of dataset and subsample it
X = iris_X.to_numpy()[:,:2]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
# Plot also the training points
p1 = plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=60, cmap=plt.cm.Paired)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Not Setosa'], loc='lower right')
from sklearn.linear_model import Perceptron, RidgeClassifier
from sklearn.svm import SVC
models = [SVC(kernel='linear', C=10000), Perceptron(), RidgeClassifier()]
def fit_and_create_boundary(model):
model.fit(X, iris_y2)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
return Z
plt.figure(figsize=(12,3))
for i, model in enumerate(models):
plt.subplot('13%d' % (i+1))
Z = fit_and_create_boundary(model)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=iris_y2, edgecolors='k', cmap=plt.cm.Paired)
if i == 0:
plt.title('Good Margin')
else:
plt.title('Bad Margin')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()
Support Vector Machines fit linear models of the form
\begin{align*} f_\theta(x) = \theta^\top \phi(x) + \theta_0 \end{align*}such as to find the maximum margin hyperplane.
Recall that the the max-margin hyperplane can be formualted as the solution to the following primal optimization problem. \begin{align*} \min_{\theta,\theta_0, \xi}\; & \frac{1}{2}||\theta||^2 + C \sum_{i=1}^n \xi_i \; \\ \text{subject to } \; & y^{(i)}((x^{(i)})^\top\theta+\theta_0)\geq 1 - \xi_i \; \text{for all $i$} \\ & \xi_i \geq 0 \end{align*}
We can turn our optimization problem into an unconstrained form: $$ \min_{\theta,\theta_0}\; \sum_{i=1}^n \underbrace{\left(1 - y^{(i)}\left((x^{(i)})^\top\theta+\theta_0\right)\right)^+}_\text{hinge loss} + \underbrace{\frac{\lambda}{2}||\theta||^2}_\text{regularizer} $$
The solution to the primal problem also happens to be given by the following dual problem: \begin{align*} \max_{\lambda} & \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} (x^{(i)})^\top x^{(k)} \\ \text{subject to } \; & \sum_{i=1}^n \lambda_i y^{(i)} = 0 \\ & C \geq \lambda_i \geq 0 \; \text{for all $i$} \end{align*}
Many algorithms in machine learning only involve dot products $\phi(x)^\top \phi(z)$ but not the features $\phi$ themselves.
We can often compute $\phi(x)^\top \phi(z)$ very efficiently for complex $\phi$ using a kernel function $K(x,z) = \phi(x)^\top \phi(z)$. This is the kernel trick.
Notice that in the SVM dual, the features $\phi(x)$ are never used directly. Only their dot product is used.
\begin{align*} J(\lambda) &= \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} \phi(x^{(i)})^\top \phi(x^{(k)}) \end{align*}If we can compute the dot product efficiently, we can potentially use very complex features.
Decision trees output target based on a tree of human-interpretable decision rules.
Neural network models are inspired by the brain.
We have a dataset without labels. Our goal is to learn something interesting about the structure of the data:
The problem of density estimation is to approximate the data distribution $P_\text{data}$ with the model $P$. $$ P \approx P_\text{data}. $$
It's also a general learning task. We can solve many downstream tasks using a good model $P$:
In Kernel density estimation we will fit a model of the form $$P(x) \propto \sum_{i=1}^n K(x, x^{(i)})$$ One example is the Gaussian kernel $K(x,z; \delta) \propto \exp(-||x-z||^2/2\delta^2)$.
We can perform clustering via density estimation with a GMM model. $$P_\theta (x,z) = P_\theta (x | z) P_\theta (z)$$
The parameters $\theta$ are the $\mu_k, \Sigma_k, \phi_k$ for all $k=1,2,\ldots,K$.
Intuitively, a GMM represents well the two clusters in the geyser dataset:
| Raw data | Single Gaussian | Mixture of Gaussians |
|---|---|---|
Suppose $\mathcal{X} = \mathbb{R}^d$ and $\mathcal{Z} = \mathbb{R}^p$ for some $p < d$. The transformation $$f_\theta : \mathcal{X} \to \mathcal{Z}$$ is a linear function with parameters $\theta = W \in \mathbb{R}^{d \times p}$: $$ z = f_\theta(x) = W^\top \cdot x. $$ The latent dimension $z$ is obtained from $x$ via a matrix $W$.
We find $W$ using one of the following equivalent objectives (figure credit: Alex Williams)
One factor is how much data you have. In the small data (<10,000) regime, consider:
In the big data regime,
Some additional advice:
Consider the following courses to keep learning about ML:
In order to get involved in research, I recommend:
Finally, a few ideas for how to get more practice applying ML in the real world: